Learning Disentangled Representations of
Timbre and Pitch for Musical Instrument Sounds Using
Gaussian Mixture Variational Autoencoders

Supplementary Audio Files and Code

Yin-Jyun Luo1, Kat Agres2,3, Dorien Herremans1,2

1Singapore University of Technology and Design
2Institute of High Performance Computing, A*STAR, Singapore
3Yong Siew Toh Conservatory of Music, National University of Singapore
yinjyun_luo@mymail.sutd.edu.sg,kat_agres@ihpc.astar.edu.sg,dorien_herremans@sutd.edu.sg

Controllable Synthesis of Instrument Sounds Given Pitch and Instrument


In this section, we complement the paper with the following synthesized Mel-spectrograms and the corresponding audio files.

We generate the Mel-spectrograms as described in the paper, and use Griffin-Lim to synthesize the waveforms. Note that we do not focus on good audio quality in this paper, and the inferior quality is mainly due to the algorithm used to synthesize the waveforms (as the original Mel-spectrograms and the generated ones result in the similar audio quality). We will address this, in the future work, by using advanced auto-regressive networks such as wavenets for audio synthesis.

Firstly, we present the audio that is synthesized using the original Mel-spectrogram. Specifically, we convert a sample to Mel-spectrogram, and resynthesize back to audio using Griffin-Lim. This is to give a reference of the audio quality obtained by Griffin-Lim in this work.

French horn
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Piano
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Cello
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Basson
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Now we demonstrate the controllable sound synthesis. As described in the paper Section 4.3, we specify the target pitch ym and instrument yk, and sample the pitch code zp and timbre code zt from the conditional distribution p(zp|ym) and p(zt|yk), respectively, where p(zp|yp)=N(μyp,diag(σyp)) and p(zt|yt)=N(μyt,diag(σyt)). In the following demonstration, we specify the same pitches for all instruments, play the audio and display the corresponding Mel-spectrograms.

English horn
French horn
Tenor Trombone
Trompet
Piano
Violin
Cello
Saxophone
Bassoon
Clarinet
Flute
Oboe

Many-to-Many timbre transfer


In this section, we demonstrate the model's applicability in timbre transfer.

As described in Section 4.4 in the paper, we first infer zp and zt of the source input, and modify zt (denoted as zsource) by:

ztransfer=zsource+αμsourcetarget,

where μsourcetarget=μtargetμsource, and α[0,1]. We then synthesize the spectrogram by passing [zp,ztransfer] to the decoder. See Fig. 4 for an illsutration of transferring French horn to piano.

Note that, in practice, we do not need labels of source instrument and pitch for timbre transfer, as the two variables are automatically inferred by q(zp|X) and q(zt|X), respectively. q(yt|X) infers the mixture component (source instrument identity) to which X belongs, and μsourcetarget is then obtained by subtracting mean of the mixture component of the target to the that of the source.

Following Fig. 5 in the paper, we demonstrate FhnPno, PnoVc, VcBn, and BnFhn.
The source instrument is gradually changed to the target instrument, by α={0,0.25,0.5,0.75,1.0}.

French horn to piano

C2 mf Fhn
F#2 pp Fhn

Piano to cello


Notice that the model is able to generalize to the pitch G6 which is not within the range of the cello.

G6 pp Pno
D3 pp Pno

Cello to Bassoon


F3 pp Vc
D#4 pp Vc

Bassoon to French horn


D#4 pp Bn
C5 pp Bn

Disentangling the spectral centroid


In this section, we present the effect of latent traverse along the 13th dimension of zt, which is discussed in Section 4.5 of the paper.

Ehn-B3-mf
Trop-D5-mf
Pno-C#6-mf
Vn-A4-mf
Bn-A3-mf
Ob-F6-mf